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Abstract — In this paper, we consider the periodic reference 
tracking problem in the framework of batch-mode reinforcement 
learning, which studies methods for solving optimal control 
problems from the sole knowledge of a set of trajectories. 
In particular, we extend an existing batch-mode reinforcement 
learning algorithm, known as Fitted Q Iteration, to the periodic 
reference tracking problem. The presented periodic reference 
tracking algorithm explicitly exploits a priori knowledge of the 
future values of the reference trajectory and its periodicity. 
We discuss the properties of our approach and illustrate it on 
the problem of reference tracking for a synthetic biology gene 
regulatory network known as the generalised repressilator. This 
system can produce decaying but long-lived oscillations, which 
makes it an interesting system for the tracking problem. In our 
companion paper we also take a look at the regulation problem 
of the toggle switch system, where the main goal is to drive the 
systems states to a specific bounded region in the state space. 

Index Terms — batch-mode reinforcement learning; reference 
tracking; fitted Q iteration; synthetic biology; gene regulatory 
networks; generalised repressilator 



I. Introduction 

There are many problems in engineering, which require 
forcing the system to follow a desired periodic reference tra- 
jectory (e.g., in repetitive control systems [1|). One approach 
to solve these problems is to define first a distance between 
the state of the system n t and the desired reference point r t at 
time t. Then explicitly seek for a control policy that minimises 
this distance for all t subject to problem specific constraints. In 
the case of systems in the Euclidean state-space, for example, 
the Euclidean distance can be used. 

In this paper, we are interested in solving reference tracking 
problems of discrete-time systems using an optimal control 
approach. In our setting, the dynamics of the system is only 
known in the form of one-step system transitions. A one-step 
system transition is a triplet {n, u,n + }, where n + denotes 
a successor state of the system in state n subjected to input 
u. Inference of near-optimal policies from one-step system 
transitions is the central question of batch-mode reinforcement 
learning Q, J3). Note that reinforcement learning [4| focuses 
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Fig. 1. A schematic depiction of an exogenously controlled generalised 
repressilator. Numbered circles represent different genes. An arrow with a 
flat end connecting two circles represents a repressive action from one gene 
product onto the other one. For example, gene 2 is repressed by the protein 
product of gene 1. 



on a more general problem of policy inference from interac- 
tions with the system. Usually, in batch-mode reinforcement 
learning, very few assumptions are made on the structure 
of the control problem. This gives batch-mode reinforcement 
learning algorithms a high degree of flexibility in comparison 
with other control methods. Batch-mode reinforcement learn- 
ing has had numerous applications in many disciplines such as 
engineering J5J , HIV treatment design [6], and medicine Q- In 
this paper, one batch-mode reinforcement learning algorithm 
is considered, namely the Fitted Q Iteration algorithm (8). 
This algorithm focuses on the computation of a so called Q 
function, which is used to compute a near-optimal control 
policy. The algorithm computes the Q function using an 
iterative procedure based on the Bellman equation, a regression 
method and the collected samples of system trajectory. 

The focus of this paper is an extension of the Fitted Q 
Iteration to the reference tracking problem. In our extension, 
we explicitly take into account the future reference trajectory 
values and the period T. Moreover, regression is separated 
into T independent regression problems, which can save 
computational effort. The algorithm is implemented in Python 
using open-source packages and libraries (9j, iflOl . ifTTl . Ifl21 . 

To illustrate our reference tracking method, we consider 
its application to a synthetic biology gene regulatory network 
known as the generalised repressilator. The repressilator is a 
ring of three mutually repressing genes pioneered in [13], and 
theoretically generalised to rings consisting of more than three 
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genes in [14J. Repression is an interaction between two genes, 
such that the protein product of the repressing gene prevents 
protein expression of the repressed gene. Or simply put, where 
one gene can turn off the other. A generalised repressilator 
with a sufficiently large, even number of genes (such as the 
four-gene ring in Figure [T| can exhibit decaying but very long- 
lived oscillations fl31 . The objective of this paper is to force 
one or several protein concentrations to follow a priori defined 
reference trajectories, namely: sinusoids and ramps. A deeper 
background on synthetic biology and its control applications 
is provided in our companion paper fl6l . 

The paper is organised as follows. Section [TT] covers batch- 
mode reinforcement learning. In Section |TlTJ an extension of 
the Fitted Q Iteration to the reference tracking problem is 



Algorithm 1 Fitted Q iteration 



presented. In Section IV our illustrative application, reference 
tracking for the generalised repressilator system, is given. 
Finally, Section [V] concludes the paper. 

II. Fitted Q Iteration 
A. Problem Formulation 

Consider a deterministic discrete-time dynamical system 



n t+ i = f(n t ,u t ) 



(1) 



where the bold values n stand for vectors with elements n\ n t 
is the state of the system at time t, and u t is a control input, 
which at each time t belongs to a compact set U. Consider 
also, associated with this dynamical system, an optimal control 
problem, which is defined in terms of the minimisation of an 
infinite sum of discounted costs c(n, u): 



oo 



nunLT' c(m, fj,(rii)) 
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where /i(-) is a mapping from the state-space onto U, which is 
called the feedback control policy, and 7 is a positive constant 
smaller than one, which is called the discount factor. For the 
purpose of this paper, it is assumed that c(v) is a given 
function. The goal is to compute the optimal policy based 
only on one-step transitions of the system (TTJ. One-step system 
transitions are given as a set T = {rig, ttj, nf}f^[, where n+ 
denotes a successor state of the system in state n; subjected 
to input ui (if the function /(•,•) is known then n+ is simply 
equal to f(m,ui)). 

B. Algorithm 

The central object in Fitted Q Iteration is the Q function, 
which is defined as follows: 

00 

Q(n t ,u t ) = c(n t ,u t ) + mm V" 7 4 ~*c(ni, ju(n;)). 

i=t+i 

The main idea of the approach is to exploit the celebrated 
Bellman equation 

Q(n, u) = c(n, u) + 7 min Q{f(n, u), u'), (2) 

which provides a convenient expression for the computation 
of the optimal feedback control policy: 



Inputs: Set of one-step system transitions T = 
{ni,ui,nfyf =v cost c(-, •), stopping criterion 
Outputs: Policy /t*(n) 
k <- 

QoM <-c(.,.) 

repeat 

k <- k + 1 

Compute |5]) to obtain the values of Qfc(v) for all 
{ni,ui} in T 

Estimate the function Qk(n,u) using a regression al- 
gorithm with input pairs (rii,Ui) and function values 
Q k (ni,ui). 

until stopping criterion is satisfied 

Compute the policy according to |6) 



The Bellman equation can be theoretically solved using an 
iterative procedure 

Q k (n,u) = c(n,u) +7 min Q fe _ 1 (/(n, u), u'), 

u'EU 

where Qq = c and Qrx, is the unique solution to |2]) due 
to the fact that T(Q) = c + 7 minuet/ Q is a contraction 
mapping (cf. fl3j). However in the continuous state-space case, 
this iterative procedure is hard to solve, especially when only 
the one-step system transitions in T are given. The Fitted Q 
iteration algorithm computes, from the sole knowledge of T 
a sequence of functions Qi, Q2, ■ ■ ■ that approximates the 

sequence Q\, Q 2 , Let Q = c and for every (ni,ui,nf) 

in T compute: 

Qi{n h ui) = c{n h ui) + 7 min Qo (4) 

This expression gives Qi only for ni, ui in T, while the entire 
function Qi(v) i s estimated using a regression algorithm 
(e.g., EXTRA Trees El). Now the iterative procedure is 
derived by generalising Q as follows: 



Q k {n h ui) = c(m,ui) 



7minQfc- 



(5) 



where at each step Qfe(-,-) is estimated using a regression 
algorithm. If the iteration procedure stops at the iteration 
number k, an approximate policy can be computed as follows: 



jj*(n) = argmin Qk(n,u) 



(6) 



H*{n) — argmin Q(n, u) 



(3) 



A simple stopping criterion can be, for example, an a priori 
fixed maximum number of iterations N^- The value Na is 
chosen such that 7^" is sufficiently small and, hence, the 
values Qk{ni,ui) are not modified significantly for k larger 
than Nn. Other, more advanced, stopping criteria are described 
in JSj. The resulting iterative method is outlined in Algorithm 
[T] A major property of the fitted Q iteration algorithm is 
convergence. This is understood as convergence of Qk to a 
fixed state-action value function Q* given a fixed set T as 
k grows to infinity. It was shown in [8], that under certain 
conditions Algorithm [T] converges and it is possible to estimate 
in advance the distance between Q* and the iterate Qk- 
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III. Periodic Reference Tracking Using the Fitted 
Q Iteration Algorithm 

A. Problem Formulation 
Consider, a system: 

rat+i = f(n t , ut) 
rt+i = g(r t ) 

where the function /(■,•) is unknown. The function g(-), 
however, is a known periodic function of period T, and the 
variable r takes only a finite number of values {vi}f =1 . Hence, 
g(vi) is equal to Wj+i for all i smaller than T, and g{vr) 
is equal to Di. The reference tracking problem is defined as 
follows: 

oo 

min j l ^ t c(d(n i ,r l ),n i , m) 
m(v) . 

subject to system dynamics (|7J, and 

K n t,r t ) = u t eU 

where c is a given instantaneous cost function, d(-,-) is a 
function defining the distance between the current state m and 
reference t%, and fi(-, •) is a feedback control policy. In order 
to track the reference r, reducing the value of the cost c must 
reduce the distance d. The instantaneous cost can optionally 
depend on the control signal u and the states n in order to 
provide additional constraints in the state-space and/or control 
action space. As for the Fitted Q Iteration, the control policy 
is inferred based solely on the trajectories given in the form of 
one-step system transitions T = {ni,ui,n^}f^, where n ; + 
denotes a successor state of the system in state n; subjected 
to input 

B. Algorithm 

The variable r can be seen as an additional state in the ex- 
tended state-space {n, r}. Based on this extended state-space 
we can derive the Bellman equation for the tracking problem 
and the following iterative procedure for the computation of 
the Q function: 

Qk(n, r, it) = c(d(n, r),n, u)+ 

min Q k ^ 1 (n + ,g(r),u') Vn,r,u (9) 

where Qq is equal to c. It can be shown that this iterative 
procedure has a unique solution Q* based on a similar con- 
traction mapping argument as in the previous section. Hence 
the optimal policy in the extended state-space is computed as: 

/i(n, r) = min Q* (n, r, u) 

The input set to the algorithm normally consists of the one- 
step system transitions T = {ni,ui,nf}fj^. However, since 
our state-space has been extended to include r, T should 
now include r as well. Since the time evolution of r is 
known a priori we can simply modify the set as follows 
TT = {n h v i7 ui,n+ ,g(vi)}ij, where i = 1, ...,T and 
1 = 1,..., # F. This will effectively copy T times the training 
set J- . Now given this modification, we can proceed to the 



computational procedure. As before Qq is equal to c and the 
next iterates can be obtained according to: 

Qk{n u v h u{) = c(d(n l ,v l ),n l ,u l )+ 

minQ k -i(nt,g{vi),u') V/, (10) 

According to the Fitted Q Iteration framework, every function 
Qk(-, ■, ■) must be estimated by a regression algorithm, which 
uses the input set {ni,Vi,ui}i t i and the corresponding values 
of the approximated function Qk(ni,Vi,ui). This input set 
can grow significantly, if the period T of the reference 
trajectory is large, which can render a regression algorithm 
computationally intractable. For example, with only a thousand 
one-step system transitions {ni,Ui,nf}f^ and period T 
equal to 200, the total number of samples {m,Vi,ui}i t i 
is equal to 200 000. Therefore, it is proposed to break up 
the regression of Qk(-,-,-) into T independent regression 
problems, one for every function Qf.(-, Vj, •). This can be done, 
because the evolution of the variable r is known in advance 
and r takes a finite number of values. To make these ideas 
more transparent, (jTOJ is rewritten using a different notation 
as follows: 

Ql'in^ui) = c(d(ni,Vi),ni,ui) + 

minQ^Cn+u) VL v it (11) 

where Qu*(-,-) stands for the function Qk(-, fj, •)• F° r ev " 
ery value Vi, the regression algorithm will approximate the 
function Q\ H {-,-) by using the input set {ni,ui}i and the 
corresponding values (ni,ui). Thus the initial regression 
problem is indeed separated into T independent problems. 

Our approach can be also seen as a modification of the Qk 
function approximator. The first layer of our approximator is a 
deterministic branching according to the values v^. After that a 
regression algorithm is performed to approximate the functions 
QV(') •) as prescribed. Finally, if the iterative procedure has 
stopped at iteration fc, a near-optimal policy can be computed 
as follows: 

/t*(n,r) = miiiC)£(n,it) (12) 

Our approach is outlined in Algorithm [2] Periodicity is a 
crucial assumption, since the period of the reference trajectory 
corresponds to the number of different Q£ functions built in 
this algorithm. Convergence of Algorithm [2] can be established 
using the following lemma. 

Lemma 1: Algorithm [2] converges under similar conditions 
and considerations as the fitted Q iteration algorithm in JS). 
Moreover, the stopping criteria from [ 8 1 can be directly applied 
to Algorithm [2] 

To prove this lemma, we have to make sure that the approx- 
imator of the Qk functions will not break the convergence 
proof in [8]. Only one modification in this approximator is 
made, which is the deterministic branching in its first layer 
according to Vi. It can be shown that this modification does 
not violate the convergence arguments in ||8). 

Note, if we assume that the cost function is time- 
independent, i.e., r is constant and equal to v, and g(v) is 
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Algorithm 2 Reference Tracking Fitted Q Iteration 
Inputs: Sets of one-step system transitions 
T = {ni, ttj, Tii~}t^, function g(-) and reference values 
{ v i}f=i> cost c {d{- 7 ■), ■, ■)> stopping criterion 
Outputs: Policy fi* (n, r) 

Qb(v)<-c(d(-, •),-,•) 
repeat 

k <r- k + 1 

Compute ( fTTj i to obtain the values of Q for all 
{ni,ui} in J 7 and Vi 

Estimate the functions QV* (n, u) for every Vi using 
a regression algorithm with input pairs (nj,ttj) and 
function values (ni,ui). 

until stopping criterion is satisfied 

Compute the policy using ( fT2| ) 



equal to v, equation ( fTTj ) becomes: 
Qk( n h u i) = c (d{n h v),n l ,u l ) + 



min Qfe_ 1 (n+ «') V/ 

which reduces Algorithm [2] to Algorithm [T] without any 
artefacts. Algorithm[T]is applied to control of a gene regulatory 
network in our companion paper |16|. 

C. Other Applications of the Algorithm 

The presented algorithm addresses the problem of optimal 
control of the system |7]), where a part of the dynamics is 
known and a part of the dynamics is to be learned. In our 
setting, the known dynamics describe a reference trajectory, 
which takes only a finite number of values. However, there are 
other problems for which Algorithm [2] can be useful. Consider 
the system, where /(•,•) describes the system dynamics and 
g(-, •) is known: 

n + = f(n, r) 

+ \ < 13 > 

r + = g{r,u) 

The function g(-, •) could be a model of an actuator or some 
dynamics of the system, e.g., a time-delay in the control signal. 
In the latter case, the function g(-, •) delays the application of 
the control action by a number of time samples. If u takes a 
finite number of values, then the problem can be solved using 
Algorithm [2] without major modifications. 

Another application is a system with dynamics /(•,•) and 
a given feedforward compensator g(-, ■), which helps to coun- 
teract the measured disturbance w: 

n + = f(n, u + r) 

r + = g(r, w) 



(14) 



IV. Tracking of Periodic Trajectories by the 
Generalized Repressilator 

A. System Description 

As an illustrative application of our method we consider 
the problem of periodic reference tracking for a six gene 
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Fig. 2. Natural oscillatory behaviour of a generalised repressilator system 
consisting of 6 genes. The blue line represents the time evolution of the protein 
concentration produced by the expression of gene 1 ; the cyan line represents 
the time evolution of the protein concentration produced by the expression of 
gene 2. The oscillations have a period of approximately 150 arbitrary time 
units but are, however, not stable. 



generalised repressilator system. There are two major species 
associated with every gene (the mRNA and protein concen- 
trations), which results in a twelve state system. Throughout 
the section we adopt the following notation: p l denotes the 
protein concentration produced by the translation of mRNA 
m l of gene i. By definition of the generalised repressilator, 
the transcription of mRNA m l is repressed by the previous 
gene expression product p 1 ^ 1 in the network. With a slight 
abuse of notation, we assume that p^ 1 is equal to p n in order 
to model the cyclic structure of the generalised repressilator. 
The dynamics of the generalised repressilator system can be 
described by the following set of deterministic equations: 



i + Op*- 1 ) 2 ' 2 

p l = c^m 1 - c\p l 



c l 2 m l + (^lfriM 1 + Si2b2U 2 



(15) 



where i is an integer from one to six, and 5ij is equal to one, 
if i is equal to j, and equal to zero otherwise. We consider a 
control scheme similar to the one used for the toggle switch 
regulation problem (see our companion paper lfl6l ). where the 
light control signals induce the expression of genes through 
the activation of their photo-sensitive promoters. The control 
signal u 1 only acts on the mRNA dynamics of gene 1, whereas 
the control signal u 2 only acts on the mRNA dynamics of 
gene 2. The influence of the light signals on the rate of 
mRNA production of genes 1 and 2 is denoted by bi and 
&2, respectively. To simplify the system dynamics and as it is 
usually done for the repressilator model lfl3l . we consider the 
corresponding parameters of the mRNA and protein dynamics 
for different genes to be equal. Hence, the trajectories will be 
very similar between the different genes. For the purpose of 
this paper, we chose: 

Vi:4 = 1.6, c l 2 =0.16, 4 = 0.16, 4 = 0.06, 

bi = b 2 = 5 

As shown in [15], this system exhibits a long-lived oscil- 
latory behaviour around an unstable limit cycle as depicted 
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in Figure [2] The "natural" period of these slowly decaying 
oscillations is around 150 arbitrary time units. 

B. Algorithm Parameters and Implementation 

The instantaneous cost function is defined differently based 
on the considered example. In the first case, the objective 
is for the concentration of protein p 2 to track an a priori 
specified reference trajectory. In the other cases, the objective 
is for the concentrations of two proteins p 1 and p 2 to track 
their respective reference trajectories. When tracking of two 
protein concentrations needs to be ensured, the cost function 
will depend on these two protein concentrations p 1 and p 2 
and on the corresponding references r\ and r\ . Note that here 
and in the sequel r\ stands for the i-th element of the vector 
r t . When tracking needs to be ensured for only one protein 
concentration (i.e., p 2 ), the cost function will only depend 
on p 2 and the scalar reference r 2 . In both cases, the cost 
depends on the control signals u % in order to penalise the use of 
light stimuli and thus minimise the metabolic burden caused 
by heterologous gene expression. The cost is defined using 
a distance between the observed state p l and the reference 
trajectories r\: 

c{p, r, u) = 100 ■ oV - r\) 2 + 100 • {p 2 - r 2 ) 2 + 

0.05 ■u 1 + 0.05 -u 2 

where a will be equal to one or zero depending on how many 
reference trajectories we want to track simultaneously. 

The discount factor 7 is equal to 0.75, the choice of 
which is guided by considerations similar to the ones in JS). 
The stopping criterion is simply a bound on the number of 
iterations, which for the purpose of this paper is 30. The set 
of system transitions is generated according to the following 
procedure. 300 system trajectories starting in a random initial 
state are generated. The control actions applied to generate 
the trajectories are random as well. Every trajectory has at 
most 300 samples. These state transitions are then gathered in 
the set T . At every step every Q Vi function is approximated 
using the regression algorithm EXTremly RAndomized Trees 
(EXTRA Trees), which was shown to be an effective regres- 
sion algorithm for the fitted Q iteration framework [17|. The 
parameters for the algorithm are set to the default values from 
0. 

The algorithm is implemented in Python using the machine 
learning (5), parallelisation |10|, graphics [11] and scientific 
computation lfT2l toolboxes. 

C. Results 

1) One sinusoidal reference trajectory, different periods: 
In this example, we are going to force the concentration of 
the protein 2 to track a sinusoid with different periods. The 
parameter a is set to zero and the sinusoid is chosen to 
resemble the natural oscillations in terms of amplitude and 
mean value: 

r 2 t = 8 + 7-sin(Ti/(27r)). 

We are testing the algorithm for the following periods T = 
50, 150, 250. We can increase the concentration of the protein 
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Fig. 3. Tracking a sinusoid in a six gene repressilator. The figure is colour 
coded: lines with the same colour correspond to the same experiment. Each 
solid curve represents the time evolution of the concentration protein 2, which 
attempts to follow the same colour reference trajectory, represented by a 
dashed line. The cyan colour corresponds to the period T = 50, the blue 
colour to T = 150 and the red colour to T = 250. 

2 directly trough application of the control signal u 2 . We 
can also decrease the concentration of the protein 2 through 
increasing the concentration of the protein 1, which acts as a 
repressor for gene 2. The results of several experiments are 
depicted in Figure [3] The natural oscillations have a period 
of approximately 150 arbitrary time units, therefore the blue 
dashed curve is easiest to follow. Using repression by gene 
1, the algorithm manages to find a schedule of light pulses 
that allows to track the dotted red curve, even though the 
period is much larger than 150. However, due to the inherent 
dynamics of the system, the algorithm cannot bring down the 
concentration of protein 2 fast enough to properly track the 
cyan curve. The repression by gene 1 is not strong enough in 
this case to allow accurate tracking. It is important to remark 
that tracking of the trajectory by protein 2 results in damping 
of the oscillations in the other proteins and mRNA dynamics. 

2) Two sinusoidal reference trajectories: For this experi- 
ment we chose two sinusoids (i.e., a is equal to one), where 
the second sinusoid lags behind the first one: 

r] =8 + 7-sin(200t/(27r)) 

r 2 = 8 + 7 • sin(200(< + 200/3)/(2tt)) 

Genes, their protein products and the corresponding ref- 
erence trajectories are colour coded in this example: the 
blue colour corresponds to protein 1 and the cyan colour 
corresponds to protein 2. The "blue" gene represses the "cyan" 
one, hence an increase in the "blue" protein concentration can 
be used to decrease the concentration of the "cyan" protein. 
However, the concentration of the "blue" protein cannot be 
decreased directly. The sinusoids are chosen with approximate 
knowledge of the natural oscillation dynamics: in terms of 
amplitude and mean value of oscillations. Their period is 
chosen equal to 200 time samples. As depicted in Figure |4] 
the algorithm can force the concentration of the first two 
proteins to follow both sinusoids. Moreover, numerous experi- 
ments were conducted starting from different initial conditions, 
which yielded similar tracking results after the initial transient 
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Fig. 4. Tracking two sinusoids in a six gene repressilator. The blue colour 
represents gene 1 in the repressilator, which represses gene 2 represented by 
the cyan colour. The dashed lines represent the reference trajectories; the solid 
lines represent the protein concentration tracking the reference with identical 
colour; finally, the coloured circled dots correspond to time instant at which 
control inputs in the form of light pulses were applied. Due to restrictions 
imposed by the system's dynamics the cyan reference should lag behind the 
blue one and the lag should be large enough to ensure appropriate tracking. 

period had elapsed. It is worth noting the interesting behaviour 
exhibited by the "blue" protein concentration: At some point 
the protein concentration starts growing without any light 
stimulation. This can be explained by the repressilator oscil- 
latory dynamics, where the protein's concentration can grow 
periodically. Moreover, since the "blue" protein concentration 
cannot be decreased directly it can grow significantly as can 
be observed during the time range between 800 and 1000 time 
units. It also means that controlling this system with a larger 
period will be harder due to this fast growth of the "blue" 
protein concentration induced by the dynamics of the system. 

3) Two ramp reference trajectories: The two ramp tracking 
setting is very similar to the two sinusoids tracking setting. 
However, in the situation depicted in upper panel of Figure [5] 
unsuccessful simultaneous tracking of two ramps is occurring. 
The reasons for such a behaviour are the same as above. The 
difference is that they are more pronounced in this case due to 
large time intervals during which a low reference value needs 
to be followed by one of the proteins. Note that the first ramp 
is followed almost perfectly by both protein concentrations. 
However, after a long time interval of low reference value, 
the concentration of the "blue" protein starts to grow due 
the inherent dynamics of the system. With a modified ramp 
such behaviour disappears as depicted in the lower panel of 
Figure [5] Even though tracking is not perfect, the algorithm 
manages to find an approximate solution to the optimal control 
problem, which is quite remarkable given the very limited 
amount of information provided by the input-output data in 
such setting. 

V. Discussion and Conclusion 

In this paper, we have presented a periodic reference track- 
ing reinforcement learning algorithm. The algorithm is based 
on the established fitted Q iteration framework, and inherits 
its properties. The proposed algorithm makes full use of the 
fact that the reference trajectory is known in advance, which 




100 200 300 400 500 600 



Time (a.u.) 

Fig. 5. Tracking the ramps in a six gene repressilator. Similar colour and 
line coding is used as in the figure above. In the upper panel, the algorithm 
finds it hard to keep both genes at low levels due to the inherent constraints 
imposed by the dynamics of the generalised repressilator; hence, the system 
cannot follow these ramp trajectories. Note that even though the first period 
is followed perfectly, the "blue" protein then starts to grow and we have no 
means to decrease its concentration. In the lower panel the ramp is changed 
so that the period of time spent at low concentrations is much shorter. This 
solves the above described issue. 

results in better sample efficiency in comparison with other 
approaches proposed in the literature. 

The algorithm is illustrated on the problem of tracking a 
periodic trajectory for the generalised repressilator system. 
This system has received considerable attention from the 
synthetic and systems biology community due to its ability to 
produce long-lived oscillatory behaviours. The algorithm was 
able to find near optimal solutions to the periodic tracking 
reference problem, even when the period of the reference 
trajectory was smaller than the natural period of oscillation 
of the system. 
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